##Data Analysis #2

## 'data.frame':    1036 obs. of  10 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ VOLUME: num  28.7 8.1 163.4 12.2 59.7 ...
##  $ RATIO : num  0.15 0.147 0.269 0.185 0.165 ...

#### Section 1:

## Skewness: 0.7147056
## Kurtosis (rockchalk adjusted): 1.667298

## Skewness: -0.09391548
## Kurtosis: 0.5354309

(1)(c) Test the homogeneity of variance across classes using bartlett.test() (Kabacoff Section 9.2.2, p. 222).

## 
##  Bartlett test of homogeneity of variances
## 
## data:  L_RATIO by CLASS
## Bartlett's K-squared = 3.1891, df = 4, p-value = 0.5267

Based on all steps in step 1, L_RATIO exhibits better conformance to homogeneity and normality of variances across age classes. There is more acceptable normality in 1a and 1b precisely because the histogram of the L_RATIO is more symmetric than the right-skewed histogram of RATIO. Also, the Q-Q plot for L_RATIO shows points closely following the diagonal, which indicates a near-normal distribution. The RATIO shows deviation, especially at the tails. Addressing skewness, we can see that L_RATIO is -0.0939, and RATIO is at 0.7147, revealing right-skewness. Kurtosis for L_RATIO is flatter but better than RATIO. Bartlett’s Test of L_RATIO fails to reject the null hypothesis, which confirms equal variances across classes. The boxplot also displays relatively consistent variance across age classes, which supports the test results. L_RATIO is the better variable for statistical tests that assume normality and homoscedasticity, as the log transformation has improved both the distribution shape and variance stability across groups.

#### Section 2 ####

##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.370 < 2e-16 ***
## SEX            2  0.091 0.04569   6.644 0.00136 ** 
## CLASS:SEX      8  0.027 0.00334   0.485 0.86709    
## Residuals   1021  7.021 0.00688                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.524 < 2e-16 ***
## SEX            2  0.091 0.04569   6.671 0.00132 ** 
## Residuals   1029  7.047 0.00685                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The two ANOVA models reveal that CLASS and SEX have significant levels on L_RATIO, with CLASS being highly significant and SEX showing a notable effect. However, CLASS: SEX in the first model is non-significant, which means that the effect of CLASS on L_RATIO does not depend on SEX. Because the interaction does not contribute meaningfully to explaining the variance, the second model provides a more efficient representation of the data without losing explanatory power. The combination of CLASS and SEX does not introduce any additional variation, which causes the simpler model the best choice.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = L_RATIO ~ CLASS + SEX, data = mydata)
## 
## $CLASS
##              diff         lwr          upr     p adj
## A2-A1 -0.01248831 -0.03876038  0.013783756 0.6919456
## A3-A1 -0.03426008 -0.05933928 -0.009180867 0.0018630
## A4-A1 -0.05863763 -0.08594237 -0.031332896 0.0000001
## A5-A1 -0.09997200 -0.12764430 -0.072299703 0.0000000
## A3-A2 -0.02177176 -0.04106269 -0.002480831 0.0178413
## A4-A2 -0.04614932 -0.06825638 -0.024042262 0.0000002
## A5-A2 -0.08748369 -0.11004316 -0.064924223 0.0000000
## A4-A3 -0.02437756 -0.04505283 -0.003702280 0.0114638
## A5-A3 -0.06571193 -0.08687025 -0.044553605 0.0000000
## A5-A4 -0.04133437 -0.06508845 -0.017580286 0.0000223
## 
## $SEX
##             diff          lwr           upr     p adj
## I-F -0.015890329 -0.031069561 -0.0007110968 0.0376673
## M-F  0.002069057 -0.012585555  0.0167236690 0.9412689
## M-I  0.017959386  0.003340824  0.0325779478 0.0111881

There is a transparent declining pattern in L_RATIO as the abalones grow from A1 to A5. The negative coefficients display that L_RATIO unfailingly decreases with age, as well as there being statistically significant differences between A1 and A3-A5. This representation implies that older abalones have systematically lower L_RATIO values. This also mirrors biological changes over time. The concrete significance of the differences reinforces the idea that age plays a crucial role in determining L_RATIO. Addressing the males and females in the single ‘Adult’ category, several comparisons tell no significant difference between males and females, which means that their L_RATIO values are statistically parallel. Additionally, although infants are significantly different from both male and female abalones, they should not be combined into the same group as adults.

#### Section 3####

## 
## ADULT     I 
##   707   329

Most infant volumes in the blue histogram are condensed at 50-300, and most adult volumes in the pink histogram range from 200-800. Infant distribution is right-skewed, while adult distribution is more expansive. Some larger infants have volumes that coincide with the smallest adults, which makes the classification less clear at the boundary. The overlap implies that volume alone might not be a perfect classifier for differentiating infants from adults. Misclassification can occur in the overlapping range where some small adults resemble large infants.

The log transformation improves the linear relationship between SHUCK and VOLUME by reducing scatterplot variability. In the SHUCK versus VOLUME plot, increased variability alongside VOLUME creates a fan-shaped spread, indicating non-constant variance. After transformation, the L_SHUCK versus L_VOLUME scatterplot shows a more even distribution of variance, which is crucial for satisfying the homoscedasticity assumption in regression modeling.CLASS levels are consistently ordered, while TYPE levels stay distinct, with infants clustered at smaller values and adults spanning a wider range. Log transformation preserves relationships while enhancing precision and linearity, which suggests that a log regression model is more effective for analyzing the SHUCK and VOLUME relationship instead of a standard linear model.

#### Section 4: ####

## 
## ADULT     I 
##   747   289
## 
## Call:
## lm(formula = L_SHUCK ~ L_VOLUME + CLASS + TYPE, data = mydata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.270634 -0.054287  0.000159  0.055986  0.309718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.796418   0.021718 -36.672  < 2e-16 ***
## L_VOLUME     0.999303   0.010262  97.377  < 2e-16 ***
## CLASSA2     -0.018005   0.011005  -1.636 0.102124    
## CLASSA3     -0.047310   0.012474  -3.793 0.000158 ***
## CLASSA4     -0.075782   0.014056  -5.391 8.67e-08 ***
## CLASSA5     -0.117119   0.014131  -8.288 3.56e-16 ***
## TYPEI       -0.021093   0.007688  -2.744 0.006180 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08297 on 1029 degrees of freedom
## Multiple R-squared:  0.9504, Adjusted R-squared:  0.9501 
## F-statistic:  3287 on 6 and 1029 DF,  p-value: < 2.2e-16

The trend in the CLASS level coefficient estimates reveals a consistent decrease in L_SHUCK as CLASS increases. The negative coefficients reveal that abalones in higher CLASS levels lean to have lower L_SHUCK values, even when controlling for L_VOLUME and TYPE. We can infer that as abalones grow, their shuck weight does not rise at the same rate as their volume. In earlier scatterplots, we know that A4 and A5 were arranged higher in volume but lower in shuck weight relative to their size. The regression model quantifies this association, which guarantees that larger abalones tend to have a lower shuck-to-volume ratio.

TYPE serves as a somewhat weak predictor in comparison to L_VOLUME and CLASS when assessing its impact on predicting L_SHUCK for collecting decisions. The coefficient for TYPE implies that infants yield scarcely less shuck weight than adults, yet this effect is little when distinguished with the relationship between L_VOLUME and L_SHUCK, as well as the more significant negative effects of CLASS at higher levels. Both L_VOLUME and CLASS are far more significant determinants of L_SHUCK than TYPE. While volume and class level present more influential understandings into expected shuck yield, TYPE contributes only a slim adjustment to gathering predictions. Although TYPE is statistically significant, its applicable relevance in predicting L_SHUCK stays restricted relative to the other variables in the model.



#### Section 5: ####

## Skewness: -0.05945234
## Kurtosis (rockchalk adjusted): -2.656692

## 
##  Bartlett test of homogeneity of variances
## 
## data:  residuals_model by CLASS
## Bartlett's K-squared = 3.6882, df = 4, p-value = 0.4498

The histogram in 5(a) suggests that the residuals are normally distributed due to the bell-shaped distribution. The QQ plot confirms this because most residuals fall close to the reference line. This representation indicates that the assumption of normally distributed residuals in the regression model is reasonably met. The scatterplot in 5(b) shows residuals plotted against L_VOLUME, and the colors separate CLASS and TYPE. There is no pattern, but if there were, it may indicate a changing variance. The boxplots show that residuals spread evenly. The Barlett test result suggests no significant difference in variance across CLASS levels, confirming constant variance. The model fits well due to the normality of residuals, the absence of strong patterns in residual plots, and homogeneous variance. These observations prove that the regression model appropriately includes L_VOLUME, CLASS-, and TYPE. Because of the strong linear relationship between L_SHUCK and L_VOLUME, L_VOLUME is a strong predictor of yield. Because the residuals show no significant infringements of regression assumptions, L_VOLUME is reliable for estimating shuck weight. By using L_VOLUME to predict L_SHUCK, harvesters can estimate the expected meat yield based on abalone size, which optimizes collection efforts.


#### Section 6: ####

## [1] 526.6383
## [1] 0.2476573
## Proportion of adults harvested: 0.2476573
## Proportion of infants harvested: 0
## Median Infant Volume: 133.8214
## Proportion of infants harvested: 0.4982699
## Proportion of adults harvested: 0.9330656
## Median Adult Volume: 384.5584
## Proportion of infants harvested: 0.02422145
## Proportion of adults harvested: 0.4993307

The median values for infants and adults highlight the distinction between the two groups, with infants generally having smaller volumes. The median infant volume indicates the size at which half are smaller, and half are larger, suggesting that a cutoff near this value protects about half of the infants. Adopting the most significant infant volume cutoff ensures total infant protection while maximizing adult yield. Alternatively, a cutoff closer to the median infant volume can still safeguard many infants while allowing for a larger harvest. Nevertheless, a cutoff near the median adult volume would be more restrictive. It preserves more adults but reduces overall harvest size. These values help establish a balanced approach to sustainable harvesting, reflecting the desired trade-off between conservation and yield.


#### Section 7 ####

## [1] 0.7255689
## [1] 0.7255689

#### Section 8: ####

## Smallest Volume Cutoff for Zero A1 Infant Harvest: 350.4548
## Proportion of Infants Exceeding Cutoff: 0.04152249
## Proportion of Adults Exceeding Cutoff: 0.5635877

#### Section 9: ####

## AUC: 0.1294961
## Poor discrimination (AUC <= 0.8).

#### Section 10

##     Cutoff       TPR       FPR Harvest_Proportion
## 1 526.6383 0.7523427 1.0000000          0.1785714
## 2 133.8214 0.0669344 0.5017301          0.8117761
## 3 384.5584 0.5006693 0.9757785          0.3667954
## 4 274.1740 0.2744311 0.8442907          0.5666023
## 5 350.4548 0.4364123 0.9584775          0.4179537
## AUC: 0.1294961
## Poor discrimination (AUC <= 0.8).

Through observation, we know that the highest cutoff guarantees that no infants are harvested. This cutoff leads to a low total harvest proportion of 0.18 and significantly reduces adult collection. In contrast, the lowest cutoff maximizes harvest potential at 0.81 but contains many infants. Median cutoffs deliver balanced choices between infant conservation and adult collection. The zero A1 infants cutoff tactically avoids harvesting infants from CLASS A1. Determining a cutoff depends on objectives: a higher cutoff supports sustainability by minimizing infant harvesting, while a lower one maximizes yield but gambles on depleting younger populations. ROC curve analysis indicates that an optimal threshold lies around the maximum difference cutoff, which best differentiates between adults and infants, seeking to minimize unintended harvesting of young abalone while sustaining industry viability.

Compromises must be evaluated when deciding which volume cutoff to use for abalone harvesting. Out of the five cutoffs, we know that the “protect all infants” method confirms that no infants are harvested, boosting long-term population sustainability. Nevertheless, this may not be a wise choice due to the possible result of significantly reduced adult yield. On the other hand, selecting the median adult volume will maximize the adult harvest but risk overharvesting juveniles, which could harm the future population. The ROC peak cutoff balances the separation between adults and infants, which can be a wise choice. It depends on the goals and economic feasibility. Several limitations need to be acknowledged. The dataset may not comprehensively represent the broader abalone population across distinct environmental conditions.Measurement inaccuracies in volume estimation can lead to misclassification, which can cause ineffectiveness. The ROC curve and AUC analysis suggest strong discrimination potential in the model, although a real-world implementation may introduce complexities not accounted for in the data. Knowing these limitations, validating the proposed cutoffs with field studies and continuous monitoring before full-scale adoption is vital. The max difference cutoff is an excellent choice because it balances conservation and harvest yield. Improving volume measurement techniques and introducing standardized field protocols will minimize misclassification errors. Regularly monitoring harvested populations will assess the impact of the selected cutoff and make necessary adjustments over time. In addition, a gradual transition approach where initial cutoffs are being tested on a smaller scale before widespread adoption will mitigate potential risks. Expanding datasets to include broader geographic locations and environmental factors will help future abalone studies. Conducting longitudinal studies to track population changes over multiple harvesting cycles will improve accuracy. It is wise to incorporate further biological and ecological variables. Some examples of the variables include mortality, reproductive patterns, and/or growth rates to clarify cutoff recommendations. Lastly, leveraging avant machine learning models for vigorous cutoff adjustments based on real-time data could improve the adaptability and preciseness of harvesting decisions.